MATTE: a pipeline of transcriptome module alignment for anti-noise phenotype-gene-related analysis

Abstract A phenotype may be associated with multiple genes that interact with each other in the form of a gene module or network. How to identify these relationships is one important aspect of comparative transcriptomics. However, it is still a challenge to align gene modules associated with different phenotypes. Although several studies attempted to address this issue in different aspects, a general framework is still needed. In this study, we introduce Module Alignment of TranscripTomE (MATTE), a novel approach to analyze transcriptomics data and identify differences in a modular manner. MATTE assumes that gene interactions modulate a phenotype and models phenotype differences as gene location changes. Specifically, we first represented genes by a relative differential expression to reduce the influence of noise in omics data. Meanwhile, clustering and aligning are combined to depict gene differences in a modular way robustly. The results show that MATTE outperformed state-of-the-art methods in identifying differentially expressed genes under noise in gene expression. In particular, MATTE could also deal with single-cell ribonucleic acid-seq data to extract the best cell-type marker genes compared to other methods. Additionally, we demonstrate how MATTE supports the discovery of biologically significant genes and modules, and facilitates downstream analyses to gain insight into breast cancer. The source code of MATTE and case analysis are available at https://github.com/zjupgx/MATTE.

[ , ] represents the concatenate of two matrixes and . and represents different phenotypes and genes. { ; } is a gene set that meets the condition.

Ability of anti-noise
In this section, we aim to discuss the reason and hypothesis why relative difference remove the noise. Consider that gene expression is influenced by three factors: phenotype effects ̂, individual variability , and batch effects , where ∼ (0, ) and ∼ ( , ).

Inter-individual correlation
In this section, we will show why how inter-individual correlation represents the gene co-expression in a view of individual. For any two gene expressions and , let where is the standard variance of a gene, ‾ and ‾ are the mean of the expression. Then the mean of is equal to the Pearson's correlation coefficient .
Thus, can be seen as the sample resolution co-expression strength of the two genes. In this way, a similar strategy can be used to explore the gene pair with co-expression difference between two phenotypes.

Briefings to the compared methods
Differential Co-expression. Expected conditional F statistic (ECF) [1] calculates the F statics under the expected condition. Python implementation refers to the R package cosine [2]. The following three methods' python implementation refers to R package dcanr [3]. Z-score [4] converts the PCC of gene pairs into statistics of gene triplets. Entropy [5] of PCC can be calculated based on probabilistic graphical models. DiffCoEX [6] constructs a scale-free network as WGCNA does and uses the topological overlap to calculate differential co-expression. Other methods. Three unsupervised methods are based on a hypothesis that highly variant genes are tend to be important. While implementation details differ as follows.
Seurat v3 HVG [8] ranking is based on a variance stabilizing transformation. Cell ranger [9] and Seurat HVG [8] ranking the dispersions of each bin which is separated by the mean value of genes. Implementation of above three unsupervised methods are based on scanpy python package [10]. Model based method extracts the weight of each gene from a SVM model with linear kernel.

Pan-cancer analysis
We have collected the pan-cancer transcriptomics data and clinical information at For pan-cancer analysis, MATTE is first performed to obtain MCs of each cancer type that characterize the cancer (compared to normal samples). Then, each sample is represented by the eigengene of MCs whose SNR is above 0.5. For subtyping, agglomerative clustering is performed based on the correlation distance. For the classification of cancer and normal samples, logistic regression is used.

Function enrichment analysis
Function enrichment analysis is performed in the DAVID web server [11]. Function annotations includes gene ontology, KEGG pathway and cytoband information. Figure S1. Summary of data processing, estimation and analysis in this study.